Goto

Collaborating Authors

 surface information


DS-ProGen: A Dual-Structure Deep Language Model for Functional Protein Design

arXiv.org Artificial Intelligence

Inverse Protein Folding (IPF) is a critical subtask in the field of protein design, aiming to engineer amino acid sequences capable of folding correctly into a specified three-dimensional (3D) conformation. Although substantial progress has been achieved in recent years, existing methods generally rely on either backbone coordinates or molecular surface features alone, which restricts their ability to fully capture the complex chemical and geometric constraints necessary for precise sequence prediction. To address this limitation, we present DS-ProGen, a dual-structure deep language model for functional protein design, which integrates both backbone geometry and surface-level representations. By incorporating backbone coordinates as well as surface chemical and geometric descriptors into a next-amino-acid prediction paradigm, DS-ProGen is able to generate functionally relevant and structurally stable sequences while satisfying both global and local conformational constraints. On the PRIDE dataset, DS-ProGen attains the current state-of-the-art recovery rate of 61.47%, demonstrating the synergistic advantage of multi-modal structural encoding in protein design. Furthermore, DS-ProGen excels in predicting interactions with a variety of biological partners, including ligands, ions, and RNA, confirming its robust functional retention capabilities.


Paved or unpaved? A Deep Learning derived Road Surface Global Dataset from Mapillary Street-View Imagery

arXiv.org Artificial Intelligence

We have released an open dataset with global coverage on road surface characteristics (paved or unpaved) derived utilising 105 million images from the world's largest crowdsourcing-based street view platform, Mapillary, leveraging state-of-the-art geospatial AI methods. We propose a hybrid deep learning approach which combines SWIN-Transformer based road surface prediction and CLIP-and-DL segmentation based thresholding for filtering of bad quality images. The road surface prediction results have been matched and integrated with OpenStreetMap (OSM) road geometries. This study provides global data insights derived from maps and statistics about spatial distribution of Mapillary coverage and road pavedness on a continent and countries scale, with rural and urban distinction. This dataset expands the availability of global road surface information by over 3 million kilometers, now representing approximately 36% of the total length of the global road network. Most regions showed moderate to high paved road coverage (60-80%), but significant gaps were noted in specific areas of Africa and Asia. Urban areas tend to have near-complete paved coverage, while rural regions display more variability. Model validation against OSM surface data achieved strong performance, with F1 scores for paved roads between 91-97% across continents. Taking forward the work of Mapillary and their contributors and enrichment of OSM road attributes, our work provides valuable insights for applications in urban planning, disaster routing, logistics optimisation and addresses various Sustainable Development Goals (SDGS): especially SDGs 1 (No poverty), 3 (Good health and well-being), 8 (Decent work and economic growth), 9 (Industry, Innovation and Infrastructure), 11 (Sustainable cities and communities), 12 (Responsible consumption and production), and 13 (Climate action).


Knowledge of Pretrained Language Models on Surface Information of Tokens

arXiv.org Artificial Intelligence

Do pretrained language models have knowledge regarding the surface information of tokens? We examined the surface information stored in word or subword embeddings acquired by pretrained language models from the perspectives of token length, substrings, and token constitution. Additionally, we evaluated the ability of models to generate knowledge regarding token surfaces. We focused on 12 pretrained language models that were mainly trained on English and Japanese corpora. Experimental results demonstrate that pretrained Figure 1: Input and output examples when asking GPT-language models have knowledge regarding token 3.5 Turbo about the surface information of words (as length and substrings but not token constitution. of 1st, Jan. 2024). The Japanese example has the same Additionally, the results imply that there meaning as the English text, asking the length of and is a bottleneck on the decoder side in terms of third character in 人類学者 (anthropologist).


Learning to Teach with Deep Interactions

arXiv.org Machine Learning

Machine teaching [42, 37] uses a meta/teacher model to guide the training of a student model (which will be used in real tasks) through training data selection, loss function design, etc. Previously, the teacher model only takes shallow/surface information as inputs (e.g., training iteration number, loss and accuracy from training/validation sets) while ignoring the internal states of the student model, which limits the potential of learning to teach. In this work, we propose an improved data teaching algorithm, where the teacher model deeply interacts with the student model by accessing its internal states. The teacher model is jointly trained with the student model using meta gradients propagated from a validation set. We conduct experiments on image classification with clean/noisy labels and empirically demonstrate that our algorithm makes significant improvement over previous data teaching methods.